Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem

In this paper, a dynamic sub-route-based self-adaptive beam search Q-learning (DSRABSQL) algorithm is proposed that provides a reinforcement learning (RL) framework combined with local search to solve the traveling salesman problem (TSP). DSRABSQL builds upon the Q-learning (QL) algorithm. Considering its problems of slow convergence and low accuracy, four strategies within the QL framework are designed first: the weighting function-based reward matrix, the power function-based initial Q-table, a self-adaptive ε-beam search strategy, and a new Q-value update formula. Then, a self-adaptive beam search Q-learning (ABSQL) algorithm is designed. To solve the problem that the sub-route is not fully optimized in the ABSQL algorithm, a dynamic sub-route optimization strategy is introduced outside the QL framework, and then the DSRABSQL algorithm is designed. Experiments are conducted to compare QL, ABSQL, DSRABSQL, our previously proposed variable neighborhood discrete whale optimization algorithm, and two advanced reinforcement learning algorithms. The experimental results show that DSRABSQL significantly outperforms the other algorithms. In addition, two groups of algorithms are designed based on the QL and DSRABSQL algorithms to test the effectiveness of the five strategies. From the experimental results, it can be found that the dynamic sub-route optimization strategy and self-adaptive ε-beam search strategy contribute the most for small-, medium-, and large-scale instances. At the same time, collaboration exists between the four strategies within the QL framework, which increases with the expansion of the instance scale.


Introduction
For a set of given cities, the traveling salesman problem (TSP) is finding the shortest route along which a salesman visits all of the cities exactly once before returning to the starting point. The TSP is a well-known combinatorial optimization problem with applications in many fields [1], such as transportation, circuit board design, production scheduling, and logistics distribution.
As a traditional NP-hard problem, numerous approaches have been proposed to solve the TSP, most of which use exact and heuristic algorithms. Exact algorithms include branch-and- bound (BnB), cut-plane, integer programming, and dynamic programming, all of which are used to obtain the global optimal solution by continuous iteration. For example, Pekny et al. (1990) [2] and Pesant et al. (1998) [3] proposed the BnB method and its variants for the TSP problem, and the famous TSP solver Corconde (http://www.math.uwaterloo.ca/tsp/concorde/ index.html) is based on the BnB algorithm. Sanches et al. (2017) [4] proposed a partitioned cross-improvement initial solution method to speed up Corconde. However, since the time cost of the exact algorithm increases exponentially with the size of the instance, they are not suitable for large-scale applications. Heuristic algorithms are widely used because of their high computational efficiency, and they can obtain a sub-optimal solution in a reasonable timeframe. Representative heuristic algorithms include the Lin-Kernighan heuristic (LKH), ant colony optimization (ACO) algorithm, genetic algorithm (GA), particle swarm optimization (PSO) algorithm, whale optimization algorithm (WOA), and gray wolf optimization (GWO). Based on this, various improved algorithms have been studied for solving the TSP problem. For example, by modifying the heuristic rules of LKH to improve its search strategy, Helsgaun (2000) [5] proposed an improved LKH algorithm. Ebadinezhad et al. (2020) [6] proposed an ACO algorithm that included a dynamic evaporation strategy (DEACO). Wang et al. (2022) [7] proposed a fine-grained fast parallel GA algorithm based on a ternary optical computer, and Zheng et al. (2022) [8] proposed a transfer learning-based PSO algorithm.   [9] proposed a variable neighborhood discrete WOA algorithm (VDWOA), while Panwar et al. (2021) [10] proposed a novel discrete GWO algorithm. However, heuristic algorithms are mostly based on random search, which lacks both an ability to learning and a theoretical foundation. They also usually require unique heuristic rules for a given problem. For this reason, designing unified solutions for combinatorial optimization problems such as TSP has become a popular research topic in machine learning.
As one of the main machine-learning methods, reinforcement learning (RL) has strong decision making and autonomous learning capabilities. RL is based on the Markov decision process (MDP), which is a sequential decision mathematical model that has natural similarity to the TSP. Therefore, many recent algorithms have adopted RL for TSP, and they can be divided into three categories.
1. Deep learning algorithm combined with RL. The deep learning (DL) algorithm combined with RL exploits the perception ability of DL and the decision-making ability of RL. Its fast solution speed and strong generalization ability give this combination great potential in finding approximate TSP solutions. For example, Vinyals et al. (2015) [11] used supervised learning for training in a pointer network for TSP. Bello et al. (2016) [12] used an actorcritic policy gradient method in a recurrent neural network. Dai et al. (2017) [13] proposed an S2V-DQN algorithm using a graph embedding network trained by deep Q-learning. Deudon et al. (2018) [14] trained a neural network for TSP by policy gradient using the reinforcement learning rule with a critic. Ma et al. (2019) [15] used hierarchical RL for training in a graph pointer network for TSP. Stohy et al. (2021) [16] used an actor-critic method for training in a hybrid pointer network. These algorithms use RL to train neural networks by testing a large number of small-scale TSP datasets, which consumes a great deal of time and resources, and the accuracy of the solution is not ideal. The essence of this type of algorithm is deep learning, using RL only as a training method.
2. Heuristic algorithm combined with RL. The heuristic algorithm combined with RL uses the strong generalization ability of RL to improve the heuristic algorithm to obtain better and faster solutions. For example, Liu et al. (2009) [17] proposed an improved GA algorithm with reinforcement mutation, using the genetic algorithm as the framework and reinforcement learning as a mutation operator, making it essentially a heuristic algorithm. Alipour et al. (2018) [18] proposed a hybridization of GAs and a multi-agent reinforcement learning (MARL) heuristic, which uses GA as a tour improvement heuristic and MARL as a construction heuristic. Costa et al. (2020) [19] proposed a 2-opt heuristic algorithm combined with deep RL for TSP, which essentially enhances the learning process of 2-opt. Combining the advantages of Q-learning (QL), Sarsa, and the Monte Carlo algorithm, Zheng et al. (2020) [20] proposed a variable strategy reinforced (VSR) approach, optimized the kopt process of LKH based on this, and designed VSR-LKH for TSP. Optimizing the parameters of the biased random-key genetic algorithm (BRKGA) by QL, Chaves et al. (2021) [21] proposed a BRKGA-QL algorithm for TSP. These algorithms are still heuristic, using RL only to optimize parameters or strategies.
3. RL algorithm. The above two categories of algorithms are not RL; they use RL for training or optimization. Algorithms based on RL are less common. For example, Ottoni et al. (2018) [22] proposed two RL algorithms for TSP, the QL and Sarsa (state action reward state action) algorithms, where the RL parameter estimation is based on a response surface model (RSM). The algorithm only optimizes the parameters of RL, and does not improve the algorithm framework, so the results are worse than the above two types of algorithms. For this reason, we design the improved QL-based algorithms in this paper to find a better QL framework to increase the efficiency of this kind of algorithm and provide a new way to solve the TSP. The principal contributions are as follows: We introduce an improved QL algorithm for solving the TSP. A new reward matrix is constructed that is able to evaluate the immediate reward of the action more accurately compared to the original reward matrix. To reduce the size of the initial search space, a new initial Q-table is constructed. To balance exploration and exploitation, a new action selection strategy is designed. To more accurately assess the actions of the agent, a new Q-value update formula is designed. A local search strategy is also designed to enhance the algorithm's local optimization capability.
The remainder of this paper is organized as follows. Section 2 presents the basic theoretical concept of RL, and a QL algorithm for the TSP problem is constructed. After analysis of the algorithm, some improvements are examined. In section 3.1, based on four proposed improvement strategies, an adaptive beam search QL algorithm is constructed, and a dynamic subroute-based adaptive beam search QL algorithm is studied in section 3.2 to further improve performance of the algorithm. Section 4 discusses our experiments. Section 5 presents conclusions and suggested future work.

Theoretical foundation
RL is an important machine-learning method proposed by Minsky in 1956. Its model is based on MDP, which shows significant advantage and potential in sequential decision problems. Its powerful exploration and autonomous learning abilities [23] are widely used in game-playing [24][25][26], robotic control [27,28], and transportation [29][30][31].
As a methodological framework for learning, prediction, and decision-making, the agent learns the optimal policy by interacting with the environment [32]. Based on different ways of action selection, RL can be categorized as value-based, policy-based, or actor-critic learning. QL and Sarsa are two representative value-based RL algorithms. Considering that the valuebased algorithm is applicable to a discrete action space, QL requires fewer parameters and has a better exploration ability [33], and is likely to find a better solution than Sarsa; thus, we focus on QL algorithms for TSP.
As a model-free RL algorithm, in QL, the agent continually interacts with the environment to obtain rewards, and the action-selection strategy can be evaluated and optimized to maximize the cumulative reward [34][35][36].
The QL algorithm can be described as an MDP, which is usually represented by a fivetuple, <S, A, P, R, γ> [37]. S is the state of the environment, s (s2S) represents a certain state, and the environment is the learning object in which the agent performs actions to change its state. A is the action set of the agent, and a (a2A) represents a certain action. As a learner, the agent tries all possible actions to cognize the environment to explore the proper actions. P is a matrix of state transition probabilities, and p(s,a)2P is the probability that the agent performs action a under state s, and the selection of action is determined by policy π. R is the reward matrix, and r(s,a)2R is the immediate reward obtained by the agent after performing action a in status s. γ is the discount factor that determines the ratio between the immediate and cumulative rewards, whose higher value indicates a greater emphasis on the latter.
QL utilizes the Q-table and reward matrix R to respectively evaluate the cumulative and immediate rewards of actions. The Q-value is updated as follows. Given state of the environment s t of step t, the agent selects action a t under the ε-greedy strategy to affect the environment so that the state is transferred to s t+1 , and an immediate reward r(s t ,a t ) is obtained. Then, the Q-value is updated as Formula (1).
where q(s t ,a t ) is the cumulative reward of taking action a t in state s t ; max a tþ1 2A qðs tþ1 ; a tþ1 Þ is the maximum cumulative reward that can be obtained by taking action a t+1 in state s t+1 ; α is a learning factor, α2[0,1]; and γ is a discount factor, γ2[0,1].
The ε-greedy strategy guides the agent to choose the action with the largest Q-value with probability ε and to randomly select the other action with probability 1−ε.
Initialize the maximum number of iterations I max ; 6.
while (current number of iterations< I max ) 7.
Initialize t and s t ; 8.
While (s t is not the terminal state) 9.
Select action a t according to ε-greedy policy in state s t ; 10.
Calculate the immediate reward r(s t ,a t ); 11.
Obtain optimal policy according to Q-

Problem mapping of QL algorithm for TSP
Solving the TSP using the QL algorithm requires the construction of five-tuples, with the following parameters.
n: total number of cities; t: number of visited cities, 1�t�n; S t : set of visited cities after visiting t cities; s t : number of t-th visited city, s t 2S t ; A t : set of unvisited cities after visiting t cities; a t : number of next city to be visited after city s t , a t 2A t ; i: number of current city, 1�i�n, which is used as the row value of the matrix P, R, Q, D; j: number of next city to be visited from current city, 1�j�n, which is used as the column value of the matrix P, R, Q, D; I c : current iteration number; I max : maximum iteration number; P: action selection probability matrix, where d(i,j) is the distance from city i to city j.

Construction of proposed QL algorithm
The process of the algorithm for solving the TSP can be described as follows. Starting from city s t (the initial value of t is 1), select city a t from A t and visit it according to the ε-greedy strategy, determine the immediate and cumulative rewards from s t to a t . Repeat this process until all cities are visited.
(1) Construct reward matrix R, where each element is calculated as (2) Construct initial Q- Formula (4) shows that we start from city s t and visit city a t with a certain probability. The city with the largest Q-value has a probability of ε to be visited, while the other (n−t−1) cities have a probability of (1−ε) /(n−t−1) to be visited.
The pseudocode of the QL algorithm for the TSP is shown as Algorithm 2. while (I c < I max ) 8.
Randomly select a city to be the initial one and denoted it as s 1 ; 9.
Determine the next city a t to be visited from s t according to Formula (4); 14.
Calculate the immediate reward r(s t ,a t ) from city s t to a t according to Formula (2); 15.
Update the cumulative reward q(s t ,a t ) from city s t to a t according to Formula (1); 16.
Calculate and save the length of n cities access sequence in set S n ; 22.
The route corresponding to S n with the shortest route length will be the optimal one. 25.End

Analysis of results
We selected 18 instances of 30-1655 cities in TSPLIB to test the proposed QL algorithm. The parameters α, γ, and ε of the algorithm were 0.1, 0.1, and 0.95, respectively. The results are compared with those of the VDWOA and DWOA algorithms [9] in the same environment. As only 12 instances are listed in [9], results of the six other instances were tested by VDWOA and DWOA; the details are listed in Table 2.
The results are as follows. For instances with fewer than 500 cities, the QL algorithm is 2-40 times faster than the comparative algorithms; for instances with 500-1000 cities, it is 4-60 times faster, and for instances with more than 1000 cities, it is 5-40 times faster. The QL algorithm has distinct advantages in solution time, but the accuracy is relatively poor.
The proposed QL algorithm (Algorithm 2) is only a simple simulation of the traditional QL algorithm (Algorithm 1), which does not consider the characteristics of TSP and the defects of the QL algorithm itself. There are four reasons for its poor accuracy: 1. Formula (3) for calculating elements of reward matrix R only considers the distance between the current city i and city j, the next to be visited, without considering the impact of the distance between city j and subsequent cities. This is too greedy, and can lead to a high probability of selecting the local optimum city.
2. The Q-table is simply initialized with 0, giving all cities the same initial Q-value, so that the algorithm searches nearly blindly in the early stage, resulting in a large search range and decreasing the convergence speed. Due to the great randomness of city selection in the early stage, the probability of worse cities being selected may be increased, causing the algorithm to easily fall into local optima.
3. Cities are selected based on a ε-greedy strategy. If ε is too large, the selection probability of the global optimal city will be reduced, and the search scope smaller, causing premature convergence. If ε is too small, the probability of selecting a globally worse city will increase, and the search scope of the algorithm will be larger, causing difficult convergence.
4. There is no new city to visit after the last city, and all that remains is to return to the starting city. Thus, there is no need to consider the cumulative reward, and Formula (1) is no longer suitable for updating the Q-value.

QL-based algorithms for TSP
We studied several methods to improve the QL algorithm based on the above analysis.

Self-adaptive beam search QL algorithm
We propose a self-adaptive beam search QL algorithm, ABSQL. Four strategies are introduced: the weighting function-based reward matrix, power function-based initial Q-table, self-adaptive ε-beam search strategy, and a new formula for updating the Q-value.

Weighting function-based reward matrix.
For reason (1) above, we draw on the idea of the finite discount cumulative reward strategy [38]. Based on Formula (2), we introduce d min and d avg to consider the effect of the distance from city j to subsequent cities from local and global dimensions, respectively, and compute where θ 1 and θ 2 control the influence degree of city i on city j and its successor, respectively, θ 1 2(0.5,1), θ 2 2(0.5,1); and d min and d avg are the minimum and average distances, respectively, from city j to the other cities to be visited. The above strategy (denoted as stgy_1) synthetically considers the minimum and average distance between the current city to be visited and subsequent cities. stgy_1 can mitigate the greediness of the original reward matrix and increase the accuracy of rewards to find a higherquality solution.

Power function-based initial Q-table.
For reason (2) above, since the distances between cities will affect city selection, the reward matrix constructed according to Formula (5) will positively affect the selection of cities. Therefore, we introduce a power function to cause positive correlation between the initial Q-table and the improved reward matrix. The improved initial Q-table is constructed as where δ is a parameter to control the correlation between q(i,j) and r´(i,j), δ2(1.0,1.2).
The improved initial Q-table (denoted as stgy_2) can reduce the search scope and enhance the search ability in the early stage, enabling faster convergence and a better-quality solution.
3.1.3. Self-adaptive ε-beam search strategy. For reason (3) above, considering that the beam search algorithm has the characteristics of fast operation and a tendency to avoid local optima, a self-adaptive ε-beam search strategy is designed. By improving the fixed ε to be selfadaptive and replacing the greedy search with a beam search, the probability of action selection can be properly controlled.
(1) Self-adaptive ε strategy In the early stage of the algorithm, the search behavior should be relatively random and the search scope should be large. In this stage, the algorithm focuses on exploration to avoid falling into local optima. In the later stage, the search scope should be small, and the algorithm turns to exploitation, while the convergence speed is gradually accelerated. Exploration and exploitation can be controlled by adjusting the value of ε. With a larger value, the algorithm tends more toward exploitation. Therefore, the value of ε should increase with iteration. The improved calculation is where b and c are parameters to control the respective initial and final values of ε.
(2) Beam search strategy Beam search is a heuristic graph search algorithm [39][40][41] that will gradually cut off some poor-quality nodes and reserve some high-quality nodes. It is usually adopted when the solution space of the graph is relatively large, to decrease the space and time cost of the search.
Based on a beam search, we sort Q-values of unvisited cities in descending order, and then select the top several as the next candidates to be visited. Considering the time cost and complexity of the algorithm, the number of candidate cities is set to 2, i.e., the current optimal or the second optimal city is selected as the next visited city. Compared with the greedy strategy, which only keeps the current optimal city, it has a larger search scope, which can increase the possibility of selecting the global optimal city, and makes the algorithm less likely to fall into local optima.
In summary, the self-adaptive ε-beam search strategy (denoted as stgy_3) can be described as where formula (a) indicates that starting from city s t , the city with the maximum Q-value is selected with probability ε´, formula (b) indicates that the city with the sub-maximum Q-value is selected with probability (1-ε´) � ε´, and formula (c) indicates that other cities are selected with probability (1-ε´) 2 /(n-t-2) in A t .
The improved strategy can balance exploration and exploitation, expand the search scope in the early stage, and enhance the search ability in the middle and later stages.
3.1.4. Improvement of Q-value update formula. For reason (4) above, the Q-value of the last visited city does not contain long-term rewards, so the value calculated by using Formula (1) that contains long-term rewards is inevitably higher than its real Q-value. Furthermore, the influence of such overestimation will gradually accumulate with iteration, making the estimation of the Q-table more inaccurate, affecting the judgment of city selection. Therefore, Formula (1) is no longer suitable for updating the Q-value of the last visited city.
The last visited city has no other city to visit and needs only to return to the starting city; hence the long-term reward in Formula (1) should not be considered. The improved Q-value update formula is while (I c < I max ) 7.
Select one city randomly to be the starting one and denoted it as s 1 ; 9.
Determine the next city a t to be visited from s t according to Formula (8); 12.
Calculate the immediate return value r´(s t ,a t ) according to Formula (5); 13. if (t! = n) 14.
Update the cumulative reward q´(s t ,a t ) from city s t to a t according to Formula (9a); 15. else if (t = = n) 16.
Update the cumulative reward q´(s t ,a t ) from city s t to a t according to Formula (9b); 17.
Calculate and save the length of n cities access sequence in set S n ; 24.
The route corresponding to S n with the shortest route length will be the optimal one.

Analysis of results.
We used the 18 instances in Section 2.4 to test the effectiveness of the proposed ABSQL algorithm through analysis of the length and graph of the route. The parameters α and γ used in the ABSQL algorithm are the same as in the QL algorithm. The newly added parameters θ 1 , θ 2 , δ, b and c are 0.8, 0.8, 1.12, 0.25, and 1.05, respectively.
(1) Route length analysis We compare solutions obtained by the ABSQL algorithm with those of QL and VDWOA, with results as shown in Table 2. Denote min, avg, and max to be the minimum, average, and maximum values, respectively, with deviation rates where opt is the known optimal solution. The comparison of ABSQL and QL shows that ABSQL has a faster solution, the quality of which is greatly improved.
For the 18 instances, min of ABSQL is better than that of QL, and its S min is 0-92% less than that of QL. Comparing ABSQL and VDWOA, in terms of time, ABSQL exceeds VDWOA in all instances. In terms of solution quality, the following results are obtained.
For 14 small-scale instances with the total number of cities less than 500, the min of ABSQL is slightly worse than that of VDWOA for eight instances, and its S min is 0.08%-5.7% greater. However, the avg of ABSQL has clear advantages over VDWOA in 10 instances, and its S avg is 0.31%-7.25% less. For four medium-and large-scale instances with more than 500 cities, the min of ABSQL is better than that of VDWOA, and its S min is 0.67%-5.11% less. Since VDWOA takes too long for medium-and large-scale instances, it cannot obtain enough solutions in a limited time to calculate avg. Therefore, the avg and S avg of the two algorithms are not compared.
The above analysis shows that ABSQL has a better ability than VDWOA for medium-and large-scale instances. However, as comparing opt values reveals, ABSQL still has some deficiencies. For small-scale instances, the S min of ABSQL is 0-10.95%, and for medium-and large-scale instances, the S min of ABSQL is 12.34%-14.31%, i.e., the S min of ABSQL increases with the scale of the problem, so the solution accuracy of ABSQL requires further improvement.
(2) Route diagram analysis Analyzing the route diagrams obtained from the above instances, a common problem is found: one or more sub-routes are not completely optimized. We select instance bays29 to analyze this problem from two aspects.

2) No cross or overlap between edges
In Fig 1, there is no crossover or overlap between edges of sub-route 23-8-24-27-16. However, the sub-route is not completely optimized. This is mainly because the sum of the lengths of edges (23,27) and (27,8) is less than that of (16,27) and (27,24). Sub-route 23-27-8-24-16 of Through the above analysis, it can be seen that the incomplete optimization of sub-routes is the main reason for the relatively poor solution of the ABSQL algorithm compared with the known optimum. Considering that the route length can be shortened by swapping some nodes of the sub-route, we introduce the 2-opt strategy to further optimize the sub-route.

Dynamic sub-route-based ABSQL algorithm
To solve the problem of incomplete optimization of sub-routes, static and dynamic sub-route optimization strategies are designed to optimize routes output by ABSQL.

Static sub-route optimization.
In this section, we perform following operations. Denote the route output by ABSQL as S n . A partition operation is performed first on S n , as follows. Starting from the first node, successively divide S n (which has n nodes) into bn/mc subroutes (each with m nodes), and denote these as sr 1 , sr 2 . . . sr bn/mc . If there is a sub-route with fewer than m nodes, it is denoted as rp, and the set of all sub-routes is Sr = {sr 1 , sr 2 , . . ., sr bn/mc , rp}. Because the number of nodes in each sub-route is fixed, the partition process is called static sub-route optimization. For sub-routes with four or more nodes in Sr, all nodes except the first and last are selected to perform 2-opt optimization, and the optimized set is denoted as Sr 1 . The optimization process is as follows.
Step 1: Select the second node in the sub-route as the starting node, as well as other nodes for (m−2) � (m−3)/2 instances of 2-opt optimization; Step 2: If the length of the sub-route does not decrease after 2-opt optimization, the optimization of the sub-route is stopped; otherwise, the new sub-route is reserved and we return to step 1.

Dynamic sub-route optimization.
Since static sub-route optimization does not consider optimization of the first and last nodes of a sub-route, and the number of nodes on each sub-route is fixed, a dynamic optimization strategy is introduced in which the first and last nodes are considered at the same time. Dynamic optimization is performed based on Sr 1 obtained by static optimization, so that the number of nodes on each sub-route can change over the optimization process, which is as follows.
Step 1: Suppose the number of current optimization rounds is h, with initial value 1, h�L. Step 2: All sub-routes in set Sr h except for rp h are merged in pairs in order, and if sr f(h)-1 h is left, it is merged with rp h ; Step 3: For each sub-route obtained in step 2, select nodes at 1/4 to 3/4 positions of the node sequence, and perform 2-opt optimization to obtain set Sr h+1 ; Step 4: If h�L, then let h = h+1, and return to step 2; otherwise, merge all sub-routes in Sr L +1 to obtain the new path.
By combining ABSQL with the dynamic sub-route optimization strategy (denoted as stgy_5), a dynamic sub-route-based self-adaptive beam search Q-learning algorithm (DSRABSQL) for TSP is designed.
The pseudocode of the above algorithm is shown as Algorithm 4. S n obtains Sr according to the division operation in Section 3.4.1 6.
Optimize Sr according to step 1 and step 2 in Section 3.4.1 to get Sr 1 7.
Merge all sub-routes in Sr L+1 in order to obtain an optimal route. 14.End

Analysis of results.
We used the 18 instances in Section 2.4 to test the proposed DSRABSQL algorithm. The parameters α, γ, θ 1 , θ 2 , δ, b and c used in the DSRABSQL algorithm are the same as in the ABSQL algorithm, with the added parameter m of 10. The performance of DSRABSQL is analyzed by comparing its optimal route diagrams with those of ABSQL.
Since most instances have too many cities or an uneven distribution, the route diagrams are not clear. We list the route diagrams of berlin52 and tsp225 for illustration.

Experiments and analysis of results
The simulation is carried out in Spyder and implemented on an Intel Core i5-7500 CPU at 3.4 GHz, with 8 GB memory and a Windows 10 (64-bit) operating system. We take 18 TSP instances and test them with city's scale from 30 to 1655. The instances that used for this experiment is taken from TSPLIB (http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/) library. Table 1 lists the simulation parameters of the three algorithms to elaborate on the characteristics of the parameters adopted by each algorithm.
The values of the parameters listed in Table 1 were obtained through experimental tests, and the specific procedures are as follows.  2. The parameters α and γ in stgy_4 inherit the optimal values of α and γ from the QL algorithm.

Procedure 3:
Determine parameter m of the DSRABSQL algorithm. Set the initial value, final value and step size of parameter m in stgy_5, and select the optimal value of these parameters based on the test results.
The QL, ABSQL and DSRABSQL algorithms are iterated for I max times, respectively. The results are taken in n t simulation runs for each instance. I max and n t are calculated as  Table 2 lists min (S min ), avg (S avg ), max (S max ), S td , and the average time cost of the 18 instances tested by QL, ABSQL, and DSRABSQL, and the corresponding values of VDWOA and DWOA.  From Table 2, it can be seen that min, avg, max, and S td of DSRABSQL are mostly better than those of VDWOA for 18 instances, which indicates that the solution quality and stability of DSRABSQL are superior to those of VDWOA.
To further analyze the performance of the algorithm, the data of Table 2 are discussed in two aspects: the first is the data comparison of QL, ABSQL and DSRABSQL, and the second is the data comparison between DSRABSQL and VDWOA.
For the first aspect, a small-(ch150), medium-(rat575), and large-scale (fl400) instance are selected to analyze the performance of DSRABSQL. Figs 7-9 illustrate min, avg, max, S td , and N min (the number of mins) obtained by using QL, ABSQL, and DSRABSQL, respectively, to test ch150, rat575, and fl1400 under different numbers of iterations. Since the time cost of the algorithm will increase with scale of the instance, ch150, rat575, and fl1400 are tested 50, 20, and 10 times, respectively.
A chi-square test is performed on N min as obtained for ch150 under 400 and 500 iterations of DSRABSQL, as shown in Fig 10, where N min is the number of mins, and I t is the number of iterations. With 1 degree of freedom, we assume Hypothesis 0: N min and I t are correlated.
For this purpose, the Sample Frequency Contingency Table of N min and I t , as shown in Table 3, must be made, and a chi-square test based on the fourfold data in Table 3 is run, with the test statistic χ 2 calculated as 6.895.
From Table 2 and Figs 7~10, it can be seen that min, avg, max, S td , and N min of DSRABSQL are better than those of QL and ABSQL, which indicates that the solution quality, stability, and convergence of DSRABSQL are all significantly improved compared with QL and ABSQL throughout the iterative process. Table 3 shows that χ 2 2(6.635, 7.879), and the probability p that Hypothesis 0 is true is between 0.99 and 0.995, which indicates that N min and I t are significantly correlated. Similarly, N min and I t of instances rat575 and fl1400 can also be inferred to

PLOS ONE
be correlated. Therefore, N min of DSRABSQL increases with I t until achieving stability, i.e., the probability of DSRABSQL for getting its minimum value min will increase with I t until stability. The above analysis shows that DSRABSQL has learning ability and can be continuously enhanced with iterations until stability.
For the second aspect, the Mann-Whitney-Wilcoxon test is used to verify the performance of DSRABSQL relative to VDWOA based on the data of Table 2. A parameter, is introduced to evaluate the performance of the algorithm under different scale instances. Since S min , S avg , and S max have different effects on performance of the algorithm, we assign them different weights, in this case 0.5, 0.4, and 0.1, respectively. The smaller S c is, the better the comprehensive performance of the algorithm is. The values of S c for 18 instances of DSRABSQL and VDWOA are calculated, with results as shown in Tables 3 and 4. To verify that DSRABSQL shows more advantages in solving medium-and large-scale instances, the 18 instances are divided into groups, G 1 and G 2 . Group G 1 includes 14 small-scale instances, and G 2 includes 4 medium-and large-scale instances. In group G 1 , S c of DSRABSQL is marked as Sample 1, and that of VDWOA as Sample 2; for group G 2 , S c of DSRABSQL is marked as Sample 3, and that of VDWOA as Sample 4. Then, the following hypotheses are made: Hypothesis 1(a): S c of DSRABSQL in G 1 and G 2 is not significantly different from that of VDWOA; Hypothesis 1(b): S c of DSRABSQL in G 1 and G 2 differ significantly from that of VDWOA.  Then, S c values of Samples 1 and 2 are sorted in ascending order. R 1 and R 2 are the respective rank sums of Samples 1 and 2. The Mann-Whitney-Wilcoxon test is performed on R 1 and R 2 to obtain statistics U 1 and U 2 , and the results are shown in Table 4. Similarly, R 3 and R 4 are the respective rank sums of Samples 3 and 4, respectively, and U 3 and U 4 are obtained by performing the same process; the results are shown in Table 5.
For U 1 and U 3 of Tables 4 and 5, it can be found that U 1 = 27<47 = U α = 0.02 <U α = 0.025 by checking the Mann-Whitney Table. U 1 is lower than U α after Bonferroni correction, U 3 = 0 = U α = 0.05 , so the probability p that Hypothesis 1(b) is true is 0.95. This shows that DSRABSQL algorithm has significant advantages in all scale instances compared with VDWOA. Fig 11 also shows N min as obtained by testing instances kroa100, ch150, d198, and fl417 for 50 times by DSRABSQL and VDWOA under the condition of convergence. The results show that min of DSRABSQL is better than that of VDWOA, and when the algorithms converge, DSRABSQL has a much higher probability of obtaining min.
To further verify the significance difference of DSRABSQL and VDWOA for groups G1 and G2, Vargha and Delaney's Â was used to compute its effect size, and the following results are obtained: Â 21 = 0.86>0.71 and Â 43 = 1>0.71. This means that probability of the DSRABSQL algorithm is better than VDWOA for instances of all scales is about 0.86-1. Combined with the conclusions obtained from Fig 10 and Table 2, it can be further concluded that DSRABSQL has better learning ability than the population-and random search-based VDWOA. The above analysis shows that performance of DSRABSQL is a significant improvement compared with QL, ABSQL, and VDWOA, because DSRABSQL undergoes two stages of improvement based on the QL algorithm. The first stage is from QL to ABSQL, with the proposed stgy _1-stgy_4 strategies, which are all improvement methods within the QL framework. The second stage is from ABSQL to DSRABSQL, with the proposed stgy_5 strategy, which is an improvement method outside the QL framework. We compare the improvement effects of these strategies in what follows.
1. Two groups of comparative experiments are designed. Group 1 includes QL and five comparison algorithms, which are obtained by combining one of the five strategies above with QL, and denoted as QL+1,. . ., QL+4, QL +5. QL+1-QL+4 are improvements within the QL framework, and QL+5 is an improvement outside the QL framework. Group 2 includes DSRABSQL and five comparison algorithms, which are obtained by removing one of the five strategies above from DSRABSQL, and denoted as DSR-1,. . ., DSR-4, DSR-5. DSR-1-DSR-4 are improvements outside the QL framework, and DSR-5 is an improvement within the QL framework. The relationships between the algorithms and the five strategies are shown in Table 6.
3. For each comparison algorithm, according to the position distribution of the boxes shown in these figures, the effects of these strategies on improvement of the algorithms are judged. The values of min, avg and max in the boxplot are converted to S c for processing, and the results are shown in Table 7.
In experimental Group 2, for ch150, the performance of DSR-2, DSR-4, DSR-1, DSR-3, and DSR-5 deteriorates gradually compared with DSRABSQL. For rat575 and fl1400, the performance of DSR-2, DSR-4, DSR-1, DSR-5, and DSR-3 deteriorates gradually compared with DSRABSQL. It can be also found that, with the increase of the instance scale, the gap between DSR-3 and the other four algorithms increases gradually, but the solution quality of DSR-1 and DSR-4 constantly approaches that of DSR-5.

PLOS ONE
The above analysis shows that for small-scale instances, stgy_5 makes the greatest contribution to QL and DSRABSQL, indicating its key role in early and later improvement stages of the algorithm. For medium-and large-scale instances, stgy_5 contributes most to the QL algorithm, and stgy_3 contributes most to DSRABSQL, which indicates that stgy_5 plays a key role in the early improvement stage of the algorithm, and that stgy_3 plays a key role in the later improvement stage. In conclusion, stgy_5 has the greatest potential for small-scale instances, and stgy_3 for medium-and large-scale instances. stgy_1 and stgy_4 also have certain potential for large-scale instances, while stgy_2 has no obvious influence for any scale of instances.
In addition, from the difference of S c of each algorithm in Table 7, it can be found that there are collaborations between strategies. To describe this phenomenon in detail, the following data are calculated based on Table 7. Since stgy_5 is the only one of the five strategies outside the QL framework, and it has the best improvement without the collaboration of other strategies, we use this as the reference strategy. The difference of S c between QL+5 and QL+k (k2 {1,2,3,4}) is denoted as d 5-k , which is calculated as and the difference of S c between DSR-5 and DSR-k is denoted as d´5 -k , which is calculated as , and S c DSR-5 represent the S c value of the corresponding instances of QL+k, QL+5, DSR-k, and DSR-5, respectively. For algorithm QL+k, no collaboration exists between stgy_1~stgy_5, but it does for DSR-k. These algorithms are divided into four classes according to their adopted strategies and the relevant data are listed in Table 8. From Table 8, the following results can be found. For ch150 of class I, S c of QL combined with stgy_1 (QL+1) is 2.14% larger than that of QL combined with stgy_5 (QL+5); however, the S c of QL combined with stgy_1, stgy_2, stgy_3 and stgy_4 (DSR-5) is 1.42% larger than that of QL combined with stgy_2, stgy_3, stgy_4 and stgy_5 (DSR-1), which indicates that with the collaboration of stgy_2, stgy_3 and stgy_4, the difference of the improvement effect between stgy_1 and stgy_5 is decreased gradually. The same is found for classes II-IV. For rat575 and fl1400, similar results can also be deduced. In other words, the improvement effect of stgy_1-stgy_4 is all worse than stgy_5 when no collaboration exists. However, under collaboration, the difference of the improvement effect between stgy_1-stgy_4 and stgy_5 gradually decreases. In addition, the improvement effect of stgy_1 and stgy_3 is sometimes better than that of stgy_5, which indicates a better collaboration effect from stgy_1-stgy_4, but stgy_5 is relatively independent.
In class I, for ch150, the S c of DSR-5 is 1.42% larger than that of DSR-1; for rat575, the S c of DSR-5 is 0.75% larger than that of DSR-1; and for fl1400, the S c of DSR-5 is 0.2% less than that of DSR-1. This all indicates that with the expansion of the instance scale and with the collaboration of stgy_2, stgy_3, and stgy_4, the difference of the improvement effect between stgy_1 and stgy_5 decreases until it overtakes that of stgy_5. The same is found for classes II-IV. Therefore, the collaboration effect between stgy_1-stgy_4 is approaching being better than that of stgy_1-stgy_5's increase with instance scale.
The above analysis shows that DSRABSQL has certain advantages in solving the TSP. To further verify the effectiveness of the algorithm, from categories 1 and 2 described in section 1, we select S2V-DQN and the 2-opt heuristic algorithm combined with deep reinforcement learning (2-opt-DRL for short) from each category to compare with DSRABSQL. Table 9 lists the min and average of S min obtained by DSRABSQL and these two algorithms for 25 instances.
For these 25 instances, compared with S2V-DQN, the results of 23 instances obtained by DSRABSQL are better, and compared with 2-opt-DRL, those of 19 instances are better. In general, the average of S min for DSRABSQL is 0.33 times that of S2V-DQN and 0.18 times that of 2-opt-DRL, which shows that the solution results of DSRABSQL still have a clear advantage compared with those of RL.

Conclusion and future work
We proposed a dynamic sub-route-based self-adaptive beam search Q-learning algorithm for medium-to large-scale TSPs. On the basis of the QL algorithm, a weighting function-based reward matrix, a power function-based initial Q-table, a self-adaptive ε-beam search strategy, a new Q-value update formula, and a dynamic sub-route optimization strategy were introduced. The first four are improvement strategies within the QL framework, and the last is outside of it.
Considering the comparative algorithms, the DSRABSQL algorithm not only has certain advantages in solution time, results, and solution stability, but strong local optimization and good learning abilities. However, there are certain deficiencies compared with the current best heuristic algorithms, such as LKH and DEACO. Therefore, it is necessary to further improve DSRABSQL.
For the five proposed improvement strategies for DSRABSQL, the experimental results show that they can accelerate the convergence speed of the algorithm, enhance its search and local optimization ability, and balance exploration and exploitation. In addition, there are strong collaboration effects among the four strategies within the QL framework, which will become stronger with expansion of the instance scale. Although the dynamic sub-route optimization strategy has the most obvious improvement effect on small-scale instances, its relative independence will lead to a fixed improvement effect, which will be limited as the problem size increases. The self-adaptive ε-beam search strategy cannot only make the greatest contributions to medium-and large-scale problems; it has strong collaboration effects with other strategies within the QL framework, and its advantage can be enhanced with increase of the instance size. Therefore, further improvement to the self-adaptive ε-beam search strategy will be studied. DSRABSQL still faces challenges for solving larger-scale problems. Moreover, the many artificial parameters increase the complexity of parameter adjustment and may result in poor solutions for certain types of problems.
In future work, we hope to improve the self-adaptive ε-beam search strategy, enhance the collaboration effects between strategies, and adjust the parameters of each strategy to find a robust QL framework to improve the solution quality of large-scale problems and obtain a model with a better solution capability.